﻿ Managing Language Resources and Tools using a Hierarchy of Annotation Schemas Dan Cristea Ionut Cristian Pistol Faculty of Computer Science, University “Al I Cuza” of Ia i, Romania Faculty of Computer Science, University “Al I Cuza” of Institute for Computer Science, Romanian Academy, Ia i, Ia i, Romania Romania ipistol@info uaic ro dcristea@info uaic ro Abstract This paper describes the concept and usage of ALPE (Automated Linguistic Processing Environment) a system designed to facilitate the management and deployment of large and dynamic collections of linguistic resources and tools ALPE can build linguistic processing chains involving the annotation formats and the tools integrated into a hierarchical structure The particularities and advantages of integrating ALPE in a project involving the development and usage of multiple linguistic resources are the main topics of this paper environment offering means to create and use processing 1 Introduction chains intended to add linguistic metadata to an input Making sure that corpora, resources and tools are reusable corpus GATE (Cunningham et al , 2002, Cunningham et in different contexts than that of the originating project is al , 2003) is a versatile environment for building and one of the recent main topics of interest in the Natural deploying NLP software and resources, allowing for the Language Processing community Re-using a resource integration of a large amount of built-ins in new initially developed for a specific project usually fails for processing pipelines that receive as input a single one of two reasons: either the resource is not enough document or corpus UIMA (Ferrucci and Lally, 2004) documented (the format is not known to the re-user), or offers the same general functionalities as GATE, but once the resource is not directly accessible (the location of the a processing module is integrated in UIMA it can be used resource is not known to the re-user) Making sure a in any further chains without any modifications (GATE project’s results are well organized and accessible ensures requires wrappers to be written to allow two new modules a better impact and a longer lasting significance, as more to be connected in a chain) Since the appearance of people will be able to use the developed resources and UIMA, the GATE developers have made available a tools module that allows GATE and UIMA processing modules One of the latest developments in NLP, and one which to be interchangeable, basically merging the “pool” of promises to have a significant impact for future linguistic modules available processing systems, is the emerging of linguistic ALPE, a new NLP meta-system still in development, annotation meta-systems, which make use of existing allows a user, even with very limited programming processing tools and implement some sort of processing capabilities, to automatically exploit already walked-on architecture, pipelined or otherwise processing paths or to configure new ones on-the-spot, by In this paper we describe ALPE, a system offering a new exploiting the annotation schemas at intermediate steps perspective to the task of exploiting NLP meta-systems, ALPE is based on the hierarchy of annotation schemas by helping a community of users to have an integrated described in (Cristea and Butnariu, 2004) In this model, look at a whole range of tools that are able to XML annotation schemas are nodes in a directed acyclic communicate on the basis of common formats graph, and the hierarchical links are subsumption For annotated linguistic resources several standardization relations between schemas In (Cristea et al , 2006) is efforts have been made, such as XCES1 and TEI2 described how the graph may be augmented with However, the proposed standardizations are not processing power by marking edges linking parent nodes universally accepted, most research projects developing to daughter nodes with processors, each realising an resources according to their own described formats More elementary NLP step recent developments, such as GOLD3, propose unification Section two of this paper presents the theory behind the methods for the various annotation formats Due to such ALPE system, and section three describes the significant methods one can easily transform the name space of a features of ALPE, relevant in the context of a large scale corpus in order to make it compatible to her/his own research project, employing multiple layers of annotation targets Several systems tried to facilitate the access to schemas and various tools Section four makes a brief existing processing tools and to ease their usage The comparison between ALPE and the two most prominent more prominent ones are GATE4 and UIMA5 Both NLP meta-systems (GATE and UIMA) The conclusions, systems make easier the access to a set of independently as well as the further planned developments are described developed NLP tools which are already parts of an in section five 12 The Underlying Model www xml-ces org/ 2 www tei-c org/ 32 1 Linguistic Metadata Organised in a Hierarchy http://www linguistics-ontology org/gold html 4We base our model on the direct acyclic graph (DAG) http://www gate ac uk/ 5described in (Cristea and Butnariu, 2002), which www research ibm com/UIMA/ configures the metadata of linguistic annotation in a operation results in a (possibly) updated hierarchy and the hierarchy of XML schemas Nodes of the graph are location of the input schema as a node of the hierarchy If distinct XML annotation schemas, while edges are the input document fully complies with a schema hierarchical relations between schemas By interacting described by a node of the hierarchy, the latter remains with the graph, a user can modify it from an initial trivial unchanged and the output indicates this found node; shape, which includes just one empty annotation schema, otherwise a new node, corresponding to the annotation up to a huge graph accommodating a diversity of schema of the input document, is inserted in the proper annotation and processing needs If there is an oriented place within the hierarchy edge linking a node A with a node B in the hierarchy (we Integrate-process is an operation aiming to properly will say also that A subsumes B or that B is a descendant attach processes to the edges of a hierarchy of annotation of A) then the following conditions hold simultaneously: schemas, mainly by labelling edges with processors, but • any tag-name of A is also in B; also by adding nodes and edges and labelling the • any attribute in the list of attributes of a tag-name connecting edges in A is also in the list of attributes of the same Apart from these basic operations that allow building a tag-name of B hierarchy from scratch or modifying an existing one by As such, a hierarchical relation between a node A and one exploiting the annotation incorporated in files, a graphical descendant B describes B as an annotation schema which interface allows the user to also define new nodes is more informative than A In general, either B has at manually, which ALPE will place at proper places in the least one tag-name which is not in A, and/or there is at hierarchy automatically But building a hierarchy can be least one tag-name in B such that at least one attribute in made independent of any explicit interaction with the its list of attributes is not in the list of attributes of the system by a user It is still not unusual that an interaction homonymous tag-name in A We will agree to use the results also in an augmentation of an existing hierarchy term path in this DAG with its meaning from the support with nodes, corresponding to user’s input and/or output graph, i e a path between the nodes A and B in the graph file Through multiple interactions, an initial minimal is the sequence of adjacent edges, irrespective of their hierarchy which is accessed by a community of users can orientation, which links nodes A and B As we will see thus be developed later, the way this graph is being built triggers its property of being connected This means that, if edges are seen 2 4 Operations on the Augmented Graph undirected, there is always at least one path linking any Three main operations can be supported by the Cristea et two nodes al (Cristea et al , 2006) model If an edge linking a node A to a node B (therefore B being 2 2 The Hierarchy Augmented with Processing Power a descendant of A) is marked with a process p, it is said In NLP, the needs for reusability of modules and the that A pipelines to B by p Equally, when a file language and application independence impose the reuse corresponding to the schema A is pipelined to B by p, it of specific modules in configurable architectures In order will be transformed by the process p onto a file that for the modules to be interconnectable, their inputs and corresponds to the restrictions imposed by the schema B outputs must observe the constraints expressed as XML This arises in augmenting the annotation of the input file schemas (observing the restrictions of the schema A) with new When processes are placed on the edges of the graph of information, as described by schema B linguistic metadata, the hierarchy of annotation schemas For any two nodes A and B of the graph, such that B is a becomes a graph of interconnecting modules More descendant of A, it is said that B can be simplified to A precisely, if a node A is placed above a node B in the When a file corresponding to the schema B is simplified to hierarchy, there should be a process which takes as input a A, it will lose all annotations except those imposed by the file observing the restrictions imposed by the schema A schema A Practically, a simplification is the opposite of a and produces as output a file observing the restrictions (series of) pipeline(s) operation(s) imposed by the schema B The merge operation can be defined in nodes pointed by In (Cristea et al , 2006) a graph (or hierarchy) of more than one edge on the hierarchical graph It is not annotation schemas on which processing modules have unusual that the edges pointing to the same node are been marked on edges is called augmented with labelled by empty processors The merge operation processing power (or simply, augmented) The null applied to files corresponding to parent nodes combines process, marked Ø, is a module that leaves an input file the different annotations contributed by these nodes onto unmodified one single file corresponding to the schema of the emerging node 2 3 Building the Hierarchy With these operations, the graph augmented with Three hierarchy building operations are introduced in processing power is useful in two ways: for goal-driven, (Cristea et al , 2006): initialize-graph, classify-file and dynamic configuration of processing architectures and for integrate-process In this section we briefly present them transforming metadata attached to documents Automatic The initialize-hierarchy operation receives no input and configuration of a processing architecture is a result of a outputs a trivial hierarchy formed by a ROOT node navigation process within the augmented graph between a (representing the empty annotation schema) Once the start node and a destination node, the resulted processes graph is initialised, its nodes and edges are contributed by being combinations of branching pipelines (serial classifying documents in the hierarchy or manually simplifications, processing and merges) In terms of The classify-file operation takes an existing hierarchy and processing, the difference with respect to GATE and a document marked with an XML metadata and classifies UIMA, both allowing only pipeline processing in which the schema of the document within the hierarchy The the whole output of the preceding processor is given as input to the next processor, is that in the described model Once the entry and exit points in the hierarchy have been the required processing may result in a combination of determined and processing flows (combination of paths in branching pipelines This is due to the introduction of the the graph) have been devised, all the rest is done by the merge operation which is able to combine two different hierarchy augmented with the processing power in the annotations on the same file Once the process is manner described above This way, the processing needed computed, then it can be applied on an input file to arrive from the input to the output is computed by the displaying a certain metadata in order to produce an hierarchy as sequences of serial and parallel processing output file with the metadata changed as intended These steps, each of them supported in the hierarchy by means two files comply with the restrictions encoded by the start of specialized modules Then the process itself is node and, respectively, the destination node of the launched on the input file hierarchy Since the graph is connected, there should always be at 2 5 ALPE least one path connecting these two nodes The paths found are made up of oriented edges and, depending on ALPE is a system implementing the described model whether the orientation of the edges is the same as that of Besides implementing all the previously described the path or not, we will have pipeline operations or features, ALPE brings several additions simplification operations A flow is a combination of paths between the start and the destination node that The core modules configures the processing which transforms any file ALPE includes 11 core modules, used in any ALPE observing the specifications of the start node (schema) hierarchy (the hierarchy augmented with processing onto a file observing the specifications of the destination power, as described) but not attached to any edge These node (schema) core modules perform built-in tasks such as language base par seg form tok pos lemma chunks morpho wsd sin full Figure 1: The ALPE core hierarchy identification, but also implement the basic operations in and output files are not identical with schemas belonging the hierarchy (among others, flow computation, merging to the hierarchy (for instance, due to differences in the and simplifying) These core modules are used in any tags name space or to configurations of attributes that ALPE hierarchy and are not replaceable by user tools convey in different ways the same information) then the They ensure that any ALPE hierarchy implements the user has to provide convertors (wrappers) able to basic behaviour, as described in this paper accommodate his notations with those corresponding to nodes of the hierarchy The core hierarchy One of the main problems in developing a new NLP The user’s needs and the selection of flows system is selecting a relevant and useful annotation format for the developed resources Establishing a The ALPE augmented hierarchy can be used in many hierarchy of generally used XML metadata is not one of ways Suppose a user wants to process an XML file from ALPE’s main purposes, but having most annotated one input format to some output format In principle, any documents adhere to some common format brings such processing task involves a transformation by some obvious benefits both to the developer of new NLP module capable to receive the input format and to output software and to the user who would have an easier time the required final format The ALPE philosophy details finding the tools required for a particular annotation task such a processing task in relation with the pair of As base for any new ALPE hierarchy is offered a core input-output schemas by establishing the way these hierarchy, with 12 annotation schemas ranging from basic schemas interrelate from the point of view of the XML format to a full XCES (Ide et al , 2000) linguistic subsumption relation Two cases can be evidenced: either annotation specification6 The intermediate formats are the two schemas do observe a subsumption relation or not designed to conform to specific requirements for When they do, then the node corresponding to the input document annotation, such as tokenization, POS-tagging, file can be connected through a direct descending or NP-chunking, etc as well as combination of these ascending edge to the one corresponding to the output file markings Figure 1 shows the ALPE core hierarchy All It will be descending if the output schema results from the nodes are subsets of the XCES standard for annotated data, input schema through some adds, and it will be ascending and the subsumption relation is observed between all pairs if in order to obtain the output, simplification applied to of nodes linked through an edge the input are required When the two schemas are not in a The 12 nodes in figure 1 correspond to XML annotation subsumption relation, then there should be a node such schemas as follows: that either both are subsumed by it, or both subsume it • base: subset of XCESAna including just cesAna ALPE comes with a core hierarchy whose nodes act as a tags – corresponding to a basic XML format; grid of fixed bench-marks with respect to which the • par: adds the par tag to the parent node – locations of the input and output schemas are set out corresponding to an XML with marked When the pair of users’ schemas matches two nodes of the paragraphs; core hierarchy, then processing can be drawn in terms of • seg: adds the s tag to the parent node – known (built-in) interconnected modules When a match corresponding to an XML with marked (modulo, as noticed above, the XML elements name space sentences; and/or differences in configurations of attributes still • form: a merge of the subsuming formats – conveying the same information) of one or even both of corresponding to an XML with marked user’s schemas against nodes of the hierarchy is not formatting (paragraphs and sentences) possible, then the non-matching schemas should be seen information; as new nodes of the hierarchy In this case it is the user’s • tok: adds the tok and orth tags to the parent node responsibility to locate also the processes which will be – corresponding to a tokenized text; assigned to the new edges which will interconnect the • pos: adds the ctag tag to the parent node – new nodes onto the hierarchy corresponding to a pos-tagged text; ALPE designs a solution to the user’s problem by first • lemma: adds the base tag to the parent node – computing all possible chains of edges which link the corresponding to a lemmatized text; input schema to the output schema and, if needed, • chunks: adds the chunk and chunklist tags to the executing them parent node – corresponding to a (Noun/Verb) Each computed flow is characterized by a set of features phrase-chunked text; These features include properties such as: flow length • morpho: adds the msd tag to the parent node – (defined as number of processing steps involved), cost corresponding to an XML displaying (for instance, if processing involving one or more morphological metadata; modules presupposes financial costs), the estimated • wsd: adds a wsd tag for semantic precision of execution, and the estimated time of disambiguation; execution The user can then select and run the flow most • sin: merges the parent nodes – corresponding to suitable to his needs an XML displaying full syntactic information; • full: merges all parent nodes 3 Features The purpose of the core hierarchy is to offer both a starting point to any new hierarchy as well as anchors for In this section we will describe a set of features any new linguistic annotation formats that a user would implemented in ALPE often wished for in environments like to include When the XML formats of the user’s input working with linguistic resources and tools We will see how these features emerge from the model described above Many of these features are key elements of the 6 http://www cs vassar edu/XCES/dtd/xcesAna dtd future European linguistic infrastructure, as seen by language an instance of the graph can be generated, in CLARIN7 which all edges keep one and the same index – the one corresponding to that particular language This means that Multilinguality all processors of that particular language should access In modern NLP, algorithms are separated from linguistic the configuring resources specific to that language in details This way, a module designed to perform a specific order for the hierarchy to work properly For instance, in task can be put to work on any language if fuelled with the graph instance of language Lx, the edge corresponding appropriate language resources This is the case, for to a POS-tagger has as index Lx, meaning that it accesses instance, with POS-taggers (see, for instance, TNT a configuring resource file that is specific to language Lx (Brants, 2000)), which are powered by specific language (that language model) models (frequency of n-grams of POS tags) A syntactic It is a fact that different languages have different sets of parser should be powered by the grammar of a language to processing tools developed, English being perhaps the be effective in parsing sentences of that language A richer, presently Ideally, the blame for the lack of a tool in shallow parser, which usually implements an abstract a specific language should be put on the lack of the automata machinery, could recognize noun phases of one corresponding configuring resource, once a language language if powered by a resource consisting of a set of independent processing module is available for that task regular expressions specific to that language It is also the case that differences exist in processing To implement multilinguality within the proposed model chains among languages For instance one language could means to map the edges of the augmented graph on a have a combined POS-tagger and lemmatizer while collection of repositories of configuring resources another one realizes these operations independently, (language models, sets of grammar rules, regular pipelining a POS-tagger with a lemmatization module expressions, etc ) which are specific to different These differences are reflected in particular instances of languages This can be achieved if the edges of the graph sections of the graph, which, although reproduce the same labelled with processes are indexed with indices set of nodes, do not allow but for certain edges linking corresponding to languages This way, to each particular them The missing edges inhibit pipelining operations tok POS tagger Lemmatizer (L1) (L1 + L2) POS lemma Tagger Ø (L3) POS tagger (L2) POS+lemma L1+L2+L3 tok tok POS tagger Lemmatizer tok (L1) (L1) Lemmatizer (L2) POS tagger+lemmatizer POS lemma lemma (L3) POS tagger Ø Ø (L2) POS+lemma POS+lemma POS+lemma L1 L3 L2 Figure 2: Computation of different flows for specific languages 7 http://www clarin eu along them, but are suited for simplification operations of a POS-tagging edge, for instance, the automatic In figure 2 is given a simple example of how ALPE POS-tagger should be placed, while under the MAN facet handles multiple languages integrated in the same – the POS-tagging annotation tool should be placed hierarchy The first hierarchy (marked as L1+L2+L3 in The configuration files of these tools can usually be the figure) has four nodes (annotation schemas): separated from the tools themselves We can say that the • tok: XML which marks lexical tokens; corresponding configuration files particularise the • POS: XML marking tokens and their annotation tools, which label edges of the graph, in the part-of-speech; same way in which language specific resources • lemma: XML marking tokens and their lemmas; particularise processing modules • POS+lemma: XML with tokens, POS and lemma information IPR and cost issues These four nodes correspond to simple processing stages Intellectual property rights can be attached to documents for linguistically annotated documents The ALPE and modules as access rights Only a user whose profile hierarchy fragment representation (shown on the corresponds to the IPR profile of a resource/tool can have L1+L2+L3 section of Figure 2) indicates the subsuming access to that file/service As a result, while computation relations between the respective nodes and the attached of processing chains within the hierarchy is open to tools For each tool, in parenthesis, it is indicated the anybody, the actual access to the dynamically computed languages for which the tool is available In the sections architectures could be banned to users which do not marked L1, L2 and L3, respectively, of Figure 2 are correspond to certain IPR profiles of certain component sketched the corresponding instantiations of this modules or resources they need sub-hierarchy for the three languages More than that, some price policies can be easily The user can provide an input document (XML with implemented within the model For instance, one can marked lexical tokens) and specify the required output imagine that the computation of a flow results also in a format as being the final node (suppose POS+lemma) computation of a price, depending on particular fees the ALPE determines the language of the input document (as chained Web servers charge for their services being L1, L2 or L3) If the input document belongs to the Out of this, it is also imaginable the graph as including language L1, the computed flow will include only tools more than one edge between the same two nodes in the available for that language Thus the only possible flow hierarchy This can happen when different modules will use the POS tagger and the Lemmatizer tools, then performing the same task are reported by different merge their results into the output format For the second contributors When these modules charge fees for their language the flow will use a different POS tagger tool, services, it is foreseeable also an optimization calculus one that requires as input a file corresponding to the with respect to the overall price over the set of paths that lemma node So the computed flow will run first the can be computed for a required processing Lemmatizer, then the POS tagger on the result For the third language, a tool is available that can directly Facing the diversity of annotation styles annotate an input file in the tok format up to the required It is a fact that, today, a huge diversity of annotation output variants circulates and is being used in diverse research We can look at the ALPE hierarchy as having three layers, communities It is far from us to belief that a Procustean one for each language The three language specific Bed policy could ever be imposed in the CL or NLP hierarchies can look completely different for each community, that would aim for a strict adoption of language, but are still able to compute and run the same standards for the annotated resources On the other hand, flows as the combining hierarchy The three layers are it is also true that efforts towards standardization are aligned by nodes which display the same XML structure continually being made (see the TEI, XCES, ISLE, etc initiatives) Moreover, Semantic Web, with its Manual versus automatic annotation tremendous need for interconnection and integration of We have seen how automatic annotation is supported by resources and applications on communicating the augmented graph But how can manual annotation be environments, boosts vividly the appeal for accommodated within this approach? standardization It is therefore foreseeable that more and Usually, in order to train processing modules in NLP, more designers will adopt recognized standards, in order developers use manually annotated corpora To create to allow easy interoperability of their applications A such corpora, they make use of annotation tools realistic view on the matter would bring into the focus the configured to help placing XML elements over a text, and standards while also providing means for users to interact to decorate them with attributes and values As such, if with the system even if they do not rigorously comply annotation tools do, although in a different way, the same with the standards jobs which can be performed by processing modules, it is We have seen already that, by classification, any schema most convenient to associate them with edges in the graph could be placed in the hierarchy Of course, classification in the same way in which processing modules are could increase in an uncontrollable way the number of associated with these edges nodes of the hierarchy The proliferation could be caused Meanwhile, it is clear that manual annotation cannot be not so much by the semantic diversity of the annotations, chained in complex processing architectures in the same as by the differences in name spaces (names of tags and way in which automatic annotation can In order to attributes) differentiate between automatic and manual processes, as Technically, this can be achieved by temporarily creating encumbered by pairs of schemas observing the links between the new schema classified by the hierarchy, subsumtion relation, it results that edges should have as a new node, and its corresponding schema in the facets, for instance AUT and MAN Under the AUT facet hierarchy Processing along such a link is different than the usual behaviour associated to the edges of the graph available and is specific to wrappers It describes a translation An evident advantage of ALPE over both GATE and process, in which the annotation is not enriched, but rather UIMA is that the processing chains in ALPE are names of XML elements and attributes are changed automatically computed, therefore requiring no human Ideally, the processing abilities of the hierarchy should intervention Moreover, they can be created between any include also the capability to automatically discover two formats defined in the hierarchy (providing the wrapping procedures This task is not trivial since it modules decorating the connecting edges are available, would require that the hierarchy “understands” the otherwise there are signalled as missing) ALPE deals intentions hidden behind the annotation, displaying, this with multilinguality, thanks to its core module that way, some kind of semantic processing capabilities which performs language identification for each input file, then is not easy to implement However, recent initiatives as selects to corresponding tools and language resources, if GOLD make us believe that significant steps forward in available GATE and UIMA are mainly focused on this direction are near us English (GATE incorporating also modules dedicated to some other languages), but the user has to make sure to 4 Evaluation select the proper modules when designing a processing chain for a document in other language than English 4 1 ALPE vs GATE and UIMA Let us consider the example of a use-case in which the In this section we will compare functionalities of ALPE user has two processing tools s/he wants to use on the with those of GATE and UIMA, systems which can give same input file and to merge the results in an output file very similar results with our Using ALPE, this user has to specify the input/output First of all, ALPE is intended primarily to facilitate the formats of the modules, then let the system integrate the user’s interaction with the system, allowing for an tools as arches linking the corresponding nodes in the programming non-expert to integrate resources and tools hierarchy (in the case when one of both of these formats As a standalone linguistic processing environment, the are not currently part of the hierarchy, they will become as user is presented with a visual representation of a such), then input the file and specify the required output hierarchy of annotation formats and has basically three format (node) Using GATE, the user has to implement main choices: s/he can add a new resource to the hierarchy the integration of the tools to make them available to the (for example enabling an already integrated processing processing chain building interface, then build and run module to work for another language by adding a two processing chains, one for each tool, then merge the corresponding language model), add a new processing results outside GATE (since it does not allow parallel tool (attached to an existing edge, or attached to a newly processing and merging of annotations) UIMA performs created edge) or compute and use a processing chain this task basically in the same way as GATE, requiring (providing the input file and selecting the output format) even more implementation when integrating the new tools, GATE offers a user interface adequate for creating and but allows annotation merging using processing chains Chains have to be built manually and presuppose an intimate knowledge of the system 4 2 Qualitative evaluation UIMA is even more oriented to the NLP professional, In order to evaluate ALPE versus human computational offering little in terms of visual user interaction A direct linguistic specialists, we have developed an ALPE comparison that would put on stage quantitative augmented hierarchy configured for a current research evaluations is difficult to be made for these kinds of project involving documents in 9 European languages systems Perhaps a better prospect would be a qualitative (Bulgarian, Czech, English, German, Dutch, Maltese, comparison performed by a significant pool of users, Polish, Portuguese and Romanian) and using a significant providers as well as consumers of language resources and number of language processing tools8 All documents tools In the following, we make just an estimative have to be annotated according to 6 main annotation comparison, but a qualitative evaluation versus human formats (and 8 optional ones), resulting a significant performance is planned hierarchy of standards This hierarchy is already Every one of the three main functionalities (adding a new implemented and serves as a management and access resource, adding a new tool, and computing and using a facility for the collected documents processing chain) is easier to perform in ALPE Both At the time of writing this paper, an ALPE core hierarchy UIMA and GATE require some formal description to be specific to the mentioned project is implemented for written for each new resource integrated into the system, English and Romanian while ALPE generates these formal descriptions automatically When adding a new processing tool, ALPE has much more permissive restrictions with regard to 5 Conclusions what tool can be integrated: it basically has to be either a We think that the model we propose and its first webservice or a command line, executable under implementation, as the ALPE system, encapsulate Windows or Linux GATE allows the user to integrate at different organisational, standardisation and processing least Java and Perl based tools, and this is done by writing features which make it interesting for the goals of a some dedicated code, a task which is however above the project like CLARIN capabilities of some users UIMA is even more restrictive, In this proposal we have been concerned with the allowing only C++ based tools to be integrated, and only following features of functionality, also identified as of after significant implementations and changes to the original code However, an extension allowing modified Perl, Python and TCL modules to be integrated is 8 LT4eL – an FP6 project (www lt4el eu) primary importance in CLARIN9: require a fee to be paid before usage Each user will be • unique access gate and distributivity: although able to contribute its own tools and annotated resources, distributed in different places, LR and LT could be, in as well as using processing chains adapted to its the vision described in this paper, identified through a specifications, both in terms of input and output formats single access gate; and cost and performance issues • metadata policy: primary text and speech documents should be given the possibility to be Acknowledgments accompanied by metadata describing human and/or Part of the work for the paper was supported by the automatic annotation over them The ALPE ROTEL (CEEX project) AMCSIT contract no conventions allow for the metadata to have a form 29/03 10 2005, the CLARIN INFRA-2007-2 2 1 2 project, which make it easily removable when the primary and the FP6 LT4eL project raw documents are needed of being recuperated; • independence of representation: it is clear that the References XML representation adopted by ALPE allows for LR to be manipulated in such a way as to benefit of the same treatment irrespective of the particular metadata T Brants (2000): TnT: a statistical part-of-speech tagger conventions; In Proceedings of the sixth conference on Applied • quick access: ALPE comes very close to the Natural Language Processing, Seattle, Washington, pag: objective that CLARIN LR and LT be accessed 224 – 231 instantaneously from all over Europe; D Cristea, C Butnariu (2004): Hierarchical XML • conversion services: the ALPE approach representation for heavily annotated corpora In incorporates features that allows easy conversion Proceedings of the LREC 2004 Workshop on operations from and onto different representations; XML-Based Richly Annotated Corpora, Lisbon, • processing services: the ALPE portal provides processing services for enrichment and or Portugal simplification of metadata attached to LR; D Cristea, C Forăscu, I Pistol (2006): • versioning: the portal allows manipulation of Requirements-Driven Automatic Configuration of different versions of data as well as of the metadata Natural Language Applications In Bernadette Sharp accompanying the texts; (Ed ): Proceedings of the 3rd International Workshop • multilinguality: the structure allows uniform on Natural Language Understanding and Cognitive treatment of documents in different languages, as Science - NLUCS 2006, in conjunction with ICEIS well as of parallel texts; 2006, Cyprus, Paphos, May 2006 INSTICC Press, • IPR issues: the structure provides means of dealing Portugal ISBN: 972-8865-50-3 with IPR H Cunningham, D Maynard, K Bontcheva, V Tablan In this paper we have described a model of dynamical (2002): GATE: A framework and graphical building of processing architectures based on a hierarchy of XML schemas and an implementation – called ALPE development environment for robust NLP tools and We have argued that ALPE brings some advantages over applications In Proceedings of the 40th Anniversary other known systems with similar objectives, mainly Meeting of the ACL (ACL’02) Philadelphia, US coming from a plus in manoeuvrability and complete H Cunningham, V Tablan, K Bontcheva, M Dimitrov automation of the configuring tasks It is also shown how (2003): Language engineering tools for collaborative ALPE, has brought already significant advantages in the corpus annotation Proceedings of Corpus Linguistics context of a multilingual research project In this context 2003, Lancaster, UK ALPE has automatically configured complex processing D Ferrucci and A Lally (2004): UIMA: an architectural chains involving several modules and documents in approach to unstructured information processing in the different languages The features brought by the addition corporate research environment, Natural Language of an ALPE type hierarchy into a complex project Engineering 10, No 3-4, 327-348 contribute significantly to acquire multilinguality, distributivity, versioning of language resources, automatic N Ide, Bonhomme P , Romary L (2000) : XCES: An and manual annotation, management of IPR and cost XML-based Encoding Standard for Linguistic Corpora, issues, as well as managing diversity of annotation styles, Proceedings of the Second International Language features that the CLARIN project considers of extreme Resources and Evaluation Conference Paris: European importance Language Resources Association One important further development of ALPE will be a web-service allowing users to build, configure and use ALPE hierarchies on the web, either as a limited password-protected resource or a global linguistic resources collection This type of hierarchy is able to manage multilingual resources as well as resources which 9 We foresee that other requirements, as, for instance, discovery of resources and tools, preservation of resources, archiving services, content discovery, distribution, authentication and authorization, could also be designed around the structure we propose 